Advancing material property prediction: using physics-informed machine learning models for viscosity

In materials science, accurately computing properties like viscosity, melting point, and glass transition temperatures solely through physics-based models is challenging. Data-driven machine learning (ML) also poses challenges in constructing ML models, especially in the material science domain where data is limited. To address this, we integrate physics-informed descriptors from molecular dynamics (MD) simulations to enhance the accuracy and interpretability of ML models. Our current study focuses on accurately predicting viscosity in liquid systems using MD descriptors. In this work, we curated a comprehensive dataset of over 4000 small organic molecules’ viscosities from scientific literature, publications, and online databases. This dataset enabled us to develop quantitative structure–property relationships (QSPR) consisting of descriptor-based and graph neural network models to predict temperature-dependent viscosities for a wide range of viscosities. The QSPR models reveal that including MD descriptors improves the prediction of experimental viscosities, particularly at the small data set scale of fewer than a thousand data points. Furthermore, feature importance tools reveal that intermolecular interactions captured by MD descriptors are most important for viscosity predictions. Finally, the QSPR models can accurately capture the inverse relationship between viscosity and temperature for six battery-relevant solvents, some of which were not included in the original data set. Our research highlights the effectiveness of incorporating MD descriptors into QSPR models, which leads to improved accuracy for properties that are difficult to predict when using physics-based models alone or when limited data is available. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00820-5.


S1.1. Sources
Table S1 summarizes the sources of viscosity that were extracted to generate the 4,440 curated dataset.
Table S1: Literature sources used to extract temperature-dependent viscosities.This table summarizes the number of instances, unique molecules, experimental temperature (T exp ) range, viscosity range, and logviscosity range for each source.Sources are listed in descending order relative to the number of instances.

S1.2. Additional dataset analysis
We further analyzed the viscosity dataset to see how well our curated dataset accounts for temperature information within the model.Figure S1 shows a histogram of the number of temperature examples per molecule for the curated viscosity dataset.Of the entire dataset, 478 unique molecules have a single temperature-viscosity example, whereas the remaining 527 unique molecules have multiple temperature-viscosity examples.The histogram shows a general decrease in molecules with multiple temperature-viscosity examples with molecules that can have up to ∼40 temperature-viscosity examples.In sum, the dataset has a balance of both chemical heterogeneity and temperature dependence of viscosity, which are both important for developing generalizable quantitative structure-property relationships (QSPR) models for predicting viscosity.

S2.1. Molecular dynamics descriptors
We computed eight descriptors from molecular dynamics (MD) simulations: 1. Packing density (MD density): Calculated by taking the average cell size of the last 20% of a stage from the equilibration protocol which includes 10 ns NPT at T exp and pressure of 1 atm.MD density has units of g/cm 3 .

Percentage free volume (MD FV):
Extracted by taking the free volume percentage from the output of the production run, which includes 20 ns NVT at T exp and pressure of 1 atm.We used the spacing between grid points as 0.25 Å for the free volume calculation.MD FV is unitless.
3. Radius of gyration of the molecule (MD Rg): Calculated by taking the average Rg of all molecules across all the snapshots in the last 10 ns of the production run, which is 100 snapshots in total.MD Rg has units of Å.
4. Heat of vaporization (MD HV): Calculated from the energy of the periodic unit cell (E cell ) minus the sum of the N individual molecules, E i , averaged over the last 10 ns of production NVT trajectory, as R is a gas constant with a value of 1.9872036 × 10 −3 kcal K −1 mol −1 , and T exp is the experimental temperature.MD HV has units of kcal/mol.

Hansen solubility parameters (MD SP, MD SP E, and MD SP V):
The solubility parameter (MD SP) is calculated using the equation: where ∆H V and V M is the molar volume.We also compute MD SP E and MD SP V by taking the van der Waals (vdW) and electrostatic contributions to the above equation, respectively.The values are averaged over the last 10 ns of the production NVT trajectory.MD SP, MD SP E, and MD SP V have units of MPa (1/2)   6. Root-mean-square displacement (MD RMSD): We compute the root-mean-square displacement of all the molecules over the last 10 ns of production NVT trajectory.MD RMSD has units of Å.

S2.2. Descriptor correlation to viscosity
We investigated whether the top descriptors as observed in Fig. 5 of the main text are correlated to viscosity.Fig. S2 shows a scatter plot between log viscosity versus top descriptors for the entire dataset with Pearson's r correlation coefficient reported on the lower right.Fig. S2A shows that log viscosity has low correlation with inverse temperature when structural information is omitted, which means that temperature information alone is not sufficient to capture viscosity trends across different chemical species.RD PEOE VSA1 was identified as a top RDKit descriptor relevant to log viscosity in the main text (Fig. 5), but this descriptor alone has a low Pearson's r of 0.42 relative to log viscosity (see Fig. S2B).MD-derived heat of vaporization (MD HV) has a higher Pearson's r of 0.76 relative to RD PEOE VSA1 (see Fig. S2C), which suggests that the MD descriptors may be better at capturing trends in viscosity as compared to two-dimensional structural descriptors.The remaining top MD descriptors (MD Rg, MD RMSD, and MD FV) have a negative correlation to log viscosity (Fig. S2D-F) and do not correlate as strongly to log viscosity as compared to MD HV.

S2.3. Stability of MD descriptors
We evaluated the stability of MD descriptors by investigating these descriptors, simulation energies, and simulation temperatures as a function of simulation time for low and high viscosity examples.As a low viscosity example, we arbitrarily selected (Z )-1,2-Dibromoethene at a temperature of 274 K that has an experimental µ of 1.2 cP.Fig. S3A shows MD Rg, MD HV, and MD RMSD versus simulation time for (Z )-1,2-Dibromoethene, which shows that MD descriptors are well-converged after 10 ns.Fig. S3B shows the simulation potential energies, kinetic energies, and temperatures as a function of simulation time for (Z )-1,2-Dibromoethene.We observe that potential energy, kinetic energy, and temperature are well-converged in the production simulations.As a high viscosity example, we arbitrarily selected 1,2-Ethanediol at a temperature of 288 K that has an experimental µ of 26 cP.Fig. S4A shows MD Rg, MD HV, and MD RMSD versus simulation time for 1,2-Ethanediol.Similar to the low viscosity example, MD descriptors are well-converged after 10 ns of simulation time.Fig. S3B shows the simulation potential energies, kinetic energies, and temperatures as a function of simulation time for 1,2-Ethanediol, which shows good convergence of energies and temperatures across the production simulation.The results from Fig. S3 and S4 highlights that even though MD struggles in directly measure viscosities for > 5 cP, the MD descriptors extracted from short 20 ns production simulations are sufficiently converged, and they can be reliably be used for developing machine learning models to predict viscosity.

S3.1. Data splitting
Fig. S5 summarizes the data splitting workflow used to evaluate the accuracy of QSPR models in predicting viscosity.First, the dataset was split with a 80% training set and 20% testing set.A five-fold cross validation (5-CV) procedure was performed on the training set to evaluate the ability of the model to generalize on the training set.In 5-CV, the training set is partitioned into five separate sets, whereby for each of the five folds, one set is left-out as the validation set and the remaining sets are used to train the model; this procedure is repeated five times until all of the data instances are within the left-out set exactly once.Only the predictions on the left-out set are reported.After selecting the best hyperparameters from 5-CV, the model is re-trained with the entire training set and used to predict the test set.This procedure was repeated five times for different train/test splits to reduce bias that is possible when performing only a single train/test split.To evaluate the impact of performing multiple train/test splits, Fig. S6 compares the 5-CV and test set root-mean-squared error (RMSE) for descriptor-based QSPR models trained to predict viscosity when performing only a single 80:20 train:test split versus an average of five 80:20 train:test splits.The 5-CV and test set RMSE is similar when performing a single train/test split as compared to averaging the performance of multiple train/test splits.The advantage of performing multiple train/test splits is that we can rigorously evaluate the QSPR model accuracy without the concern of being "lucky" for a single train/test split.Hence, we report only the performance of QSPR models evaluated from multiple train/test splits in the main text.

S3.2. Hyperparameters
Table S2 summarizes the hyperparameter search space for descriptor-based QSPR models.After selecting the best hyperparameters from five-fold cross validation (5-CV), the model is re-trained with the entire training set and used to predict the testing.Hyperparameters for GNN QSPR were arbitrarily fixed based on ranges from Ref. [15] and summarized in Table S3.
Table S2: Hyperparameter search space for the descriptor-based QSPR models trained to predict temperature-dependent log viscosities.Each hyperparameter name is shown with the range of values in brackets.The best parameters were identified using the scikit-learn grid search method with 2D descriptors only (2D) or combinations of 2D and MD descriptors (2D and MD).No hyperparameters were tuned for MLP models.

Model
Hyperparameter Search Space Best Parameters (2D) Best Parameters (2D and MD) LGBM n estimators: {200.0,500.0}reg alpha: {0.0, 0.0, 0.1, 0. Table S3: Hyperparameters for graph-based QSPR models.All graph-based QSPR models had three graph convolutional hidden layers with node sizes of (128, 64, 32), followed by a fully connected layer of 128 nodes.A learning rate of 0.01, dropout of 0.25 and 500 epochs were used to train the models.This table summarizes hyperparameters specific to each individual models, as described previously in Ref. [15].For multilayer perceptron (MLP) models, we evaluated whether changing the hyperparameters would improve prediction accuracy by limiting the extent of overfitting that is often observed in highly parameterized neural network models.Fig. S7 compares the performance of a MLP model on a test set for an 80:20 train:test split when the early stopping parameter was either True or False or when the number of hidden layers are lowered from four to two for predicting log viscosities.The early stopping parameter prevents the MLP from overfitting by stopping the training after the validation coefficient of determination no longer improves, where the validation set is arbitrarily selected from 10% of the training data.Given the similar test set RMSE values when varying the early stopping parameter, the results suggest that attempting to reduce overfitting by turning early stopping parameter from False to True for the MLP model does not improve model performance on predicting log viscosities.We then tested whether using a shallower neural network with only two hidden layers rather than four hidden layers would improve model performance.As shown in Fig. S7, using a shallower, two-layer neural network yields similar performance to the deeper, four-layer neural network performed in the main text.Hence, the findings suggest that a shallower neural network perform similarly to a deeper neural network, but we do not observe significant improvements in the test set RMSE.We also tested a onelayer neural network with only five neurons, which yielded an RMSE of 0.34.This value is significantly higher than those obtained from the two-layer or four-layer neural networks.Therefore, neural networks with two or more layers outperform simple single-layer neural networks for this specific viscosity dataset

S3.3. Performance of QSPR Models
Table S4 shows the model score, test set RMSE, and test set mean absolute error(MAE) for descriptor-based and GNN QSPR models with and without the inclusion of MD descriptors.Overall, tree-based models have the lowest test set RMSEs, such as LGBM, XGB, and GBR, followed by GNN QSPR models, such as EdgePool, GIN, SAGPool.The inclusion of MD descriptors generally improves the QSPR models in predicting viscosity with lower test set MAEs and RMSEs (or no changes).

S3.5. Impact of noise from MD descriptors on QSPR models
We investigated the impact of noisy MD descriptors by analyzing descriptors of cyclohexane as a function of temperature.Fig. S9 shows heat of vaporization computed from MD as a function of temperature for cyclohexane between the values reported in the main text (i.e. 1 replicate of MD simulations) as compared to the average value of 5 replicates of MD simulations.The results show that running more replicates of MD simulations can yield a better monotonic trend between MD HV and temperature, which suggests that increasing the number of replicates for MD simulations can reduce the noise when computing MD descriptors.Given that the majority of the reported values are within the error of the values from the 5 replicates, it suggests that a single run of MD can obtain reasonable MD descriptor estimates without requiring extensive computation time for multiple replicates.We further evaluated the impact of noise from MD descriptors on QSPR model predictions by perturbing MD HV values for cyclohexane at T = 298.5K while keeping all other MD descriptors fixed and observing how viscosity predictions change as a function of MD HV.We perturbed MD HV by adding random numbers from a normal distribution with a mean MD HV value of 7.34 and standard deviation of 0.04.The average and standard deviation value of MD HV was estimated from the five MD simulation replicates for cyclohexane as described in Fig. S9.Fig. S10 shows the predicted log viscosity as a function of perturbed MD HV values for cyclohexane at T = 298.5K for LGBM (2D+MD) and EdgePool (MD) models that were trained with the entire viscosity dataset.The results show that adding noise to MD HV minimally impacts the prediction of log viscosities, where variations in MD HV with a standard deviation of 0.04 results in a standard deviations of ∼0.002-0.003 in predicted log viscosities for LGBM and EdgePool models.

S4.1. Data
The viscosity dataset is provided in a CSV file format under the Creative Commons Non-Commercial 4.0 International (CC-BY-NC 4.0) Attribution License.This license allows for the use of the dataset and the creation of adaptations, exclusively for non-commercial purposes, provided that appropriate credit is given.
Due to copyright restrictions, we cannot release the viscosity data from Ref. [3]; hence, we provide a subset of the dataset without data from Ref. [3].The supporting information contains a spreadsheet containing 3,582 examples of temperature-dependent viscosities with the columns described in Table S5.The available LGBM model was trained with the subset dataset described in Section S4.1 using all 2D descriptors and inverse temperature as model inputs.Fig. S11 compares the available LGBM model against one that is trained with the full dataset (see Fig. 2D of the main text).Fig. S11A compares the test set RMSE for an 80:20 train:test split for a LGBM model trained when using either the full dataset or the subset dataset.The test RMSEs for a LGBM model trained with either datasets are very similar, suggesting that the LGBM model accurately predicts the test set even when using a subset dataset.Fig. S11B shows a parity plot of predicted versus actual log viscosities for the training and testing sets for a LGBM model trained with the subset dataset.The resulting train/test R 2 and RMSE is comparable to that observed in Fig. 2D of the main text.In sum, the available LGBM model that is trained with the subset dataset demonstrates comparable accuracies as the one reported in the main text.

S4.3. Prediction of solvents related to battery electrolyte design
Predictions of 50 solvents related to battery electrolyte design for lithium metal anodes from Ref. [16] are made available using an EdgePool model trained with the entire viscosity dataset and without the inclusion of MD descriptors.The supporting information contains a spreadsheet with 650 rows of predicted log viscosities as a function of temperature from 270 K to 330 K. Table S6 summarizes the column names for this spreadsheet.
Table S6: Description of columns for the predictions of solvents from Ref. [16] when using an EdgePool model trained with the entire viscosity dataset and without the inclusion of MD descriptors.Predictions were performed for temperature ranging from 270 K to 330 K.

Figure S1 :
Figure S1: Histogram of number of temperature examples per molecule for the viscosity dataset.

Figure S2 :
Figure S2: Correlation of top descriptors to viscosity.Scatter plot of log-scale viscosities (µ) to (A) experimental inverse temperature (Inv.Temp.), (B) "MOE-like" charge van der Waal's surface area descriptor (RD PEOE VSA1), (C) heat of vaporization (MD HV), (D) radius of gyration of the molecule (MD Rg), (E) root-mean-square displacement (MD RMSD), and (F) percentage free volume (MD FV).Pearson's r correlation coefficient between log µ and descriptor is shown in the lower right hand of each plot.

Figure S3 :
Figure S3: MD stability for a low viscosity example: (Z )-1,2-Dibromoethene at a temperature of 274 K. (A) Radius of gyration of the molecule (MD Rg), heat of vaporization (MD HV), and root-mean-square displacement (MD RMSD) descriptors versus simulation time.(B) Potential energy, kinetic energy, and temperature as a function of simulation time.Ensemble average values computed using the last 10 ns of the simulation is shown as a visual guide.Molecular structure, temperature, and experimental viscosity (µ) are shown in the upper-left plot.

Figure S4 :
Figure S4: MD stability for a high viscosity example: 1,2-Ethanediol at a temperature of 288 K. (A) Radius of gyration of the molecule (MD Rg), heat of vaporization (MD HV), and root-mean-square displacement (MD RMSD) descriptors versus simulation time.(B) Potential energy, kinetic energy, and temperature as a function of simulation time.The visual guide and format is the same as Figure S3.

Figure S5 :
Figure S5: Data splitting workflow used to train and evaluate QSPR models in predicting viscosity.

Figure S6 :
Figure S6: Comparison of 5-CV and test set RMSE for a (A) single versus (B) an average of multiple 80:20 train:test splits for descriptor-based QSPR models trained to predict viscosity.The average of multiple 80:20 train:test splits is reported in the main text as Fig. 2B.The average RMSE is reported across five out-of-sample train-test splits and the RMSE uncertainty is estimated by computing the standard deviation across the splits.

Figure S7 :
Figure S7: Test RMSE between predicted and actual log viscosities for an 80:20 train:test split using a MLP model trained with either the early stopping parameter as True or False or when using a shallower neural network of two layers instead of four layers.Only 2D descriptors and external temperature were used as inputs to the model.The average test RMSE is reported, and the RMSE uncertainty is estimated by computing the standard deviation across five random, out-of-sample train-test splits.

S3. 4 .
Fig. S8 shows the percent change as a function of training size in test RMSE for LGBM and EdgePool models when including MD descriptors.Both LGBM and EdgePool models achieve at least 15% reduction in test RMSE at small data sizes with 500 training examples.As the training size increases, the percent reduction in test RMSE due to MD descriptors diminishes in magnitude and plateaus to about 10% reduction in test RMSE at 3,500 training examples.The results highlight the improvement in test RMSE across all training sizes when including just eight MD descriptors for QSPR models.

Figure S8 :
Figure S8: Percent change in test set RMSE as a function of training size for LGBM and EdgePool models when including MD descriptors.

Figure S9 :
Figure S9: Simulation-derived heat of vaporization (MD HV) versus temperature for cyclohexane using either one replicate simulation reported in the main text or five replicate simulations.For five replicate simulations, the average MD HV value is reported and the error bar is the standard deviation of MD HV values.

Figure S10 :
Figure S10: Predicted log viscosities (µ) when perturbing simulation-derived heat of vaporization (MD HV) for cylohexane at T = 298.5K for LGBM (2D+MD) and EdgePool (MD) models.Predicted log µ for the original MD HV value with no noise is shown as a red point.Fifty perturbations were performed by adding a random numbers to MD HV from a normal distribution with a mean MD HV value of 7.34 and standard deviation of 0.04.

Figure
Figure S11: (A) Test RMSE comparison for an 80:20 train:test split using a LGBM model trained with either the full dataset described in the main text or the subset dataset described in Section S4.1.The average test RMSE is reported, and the RMSE uncertainty is estimated by computing the standard deviation across five random, out-of-sample train-test splits.(B) Parity plot between predicted and actual log viscosity for an 80:20 train:test split for the LGBM model trained with the subset dataset.The statistics and visual guides are the same as Fig. 2D of the main text.

Table S4 :
Model score (Score M ), test set RMSE, and test set MAE with and without MD descriptors for descriptor and GNN QSPR models trained to predict the viscosity dataset with 80:20 train:test split.Reported Score M , RMSEs, and MAEs shows the average performance on predicting log viscosities for five random, out-of-sample train-test splits, and the uncertainties of these metrics are measured by the standard deviations of the train-test splits.This table is sorted by descending Score M (Without MD) values.

Table S5 :
Description of columns for the curated viscosity dataset.
Inverse experimental temperature in K −1 .EdgePool log(Viscosity) pred Predicted log viscosities when using an Edge-Pool model without MD descriptors.is within training True if the structure is within the viscosity dataset used to train the model.